|
A data lake is a large storage repository and processing engine. They provide "massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs".〔 The term was coined by James Dixon, Pentaho chief technology officer.〔 Dixon used the term initially to contrast with "data mart", which is a smaller repository of interesting attributes extracted from the raw data. He wrote: "If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples." 〔 Dixon argued that data marts have several inherent problems, and that data lakes are the optimal solution. Dixon identified 2 shortcomings of data marts: "Only a subset of the attributes are examined, so only pre-determined questions can be answered." and "The data is aggregated so visibility into the lowest levels is lost." 〔 These problems are often referred to as "siloing" and, in agreement with Dixon, PricewaterhouseCoopers says that data lakes could "put an end to data silos".〔 In their study on data lakes they note that "Enterprises across industries are starting to extract and place data for analytics into a single, Hadoop based repository." They note organizations such as UC Irvine Medical Center, Google and Facebook who have embraced the data lake concept. PricewaterhouseCooper also claim that cost is a major reason that organizations adopt data lakes they state that: == Examples of data lakes == Currently the only viable example of a data lake is Apache Hadoop. Many companies also use cloud storage services such as Amazon S3 along with other open source tools Docker as a data lake.〔 There is also an academic interest in the concept of data lakes for example (Personal DataLake )〔http://ieeexplore.ieee.org/xpl/abstractAuthors.jsp?reload=true&arnumber=7310733〕 an ongoing project at Cardiff University to create a new type of data lake to manage the big data of individual users and provide a single point for collecting, organizing, and sharing personal data.〔http://www.researchgate.net/publication/283053696_Personal_Data_Lake_With_Data_Gravity_Pull〕 抄文引用元・出典: フリー百科事典『 ウィキペディア(Wikipedia)』 ■ウィキペディアで「Data lake」の詳細全文を読む スポンサード リンク
|